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Abstract 

Recently, it has been shown how sampling actions from the predic- 
tive distribution over the optimal action — sometimes called Thompson 
sampling — can be applied to solve sequential adaptive control problems, 
when the optimal policy is known for each possible environment. The 
predictive distribution can then be constructed by a Bayesian superposi- 
tion of the optimal policies weighted by their posterior probability that 
is updated by Bayesian inference and causal calculus. Here we discuss 
three important features of this approach. First, we discuss in how far 
such Thompson sampling can be regarded as a natural consequence of the 
Bayesian modeling of policy uncertainty. Second, we show how Thomp- 
son sampling can be used to study interactions between multiple adaptive 
agents, thus, opening up an avenue of game-theoretic analysis. Third, we 
show how Thompson sampling can be applied to infer causal relationships 
when interacting with an environment in a sequential fashion. In sum- 
mary, our results suggest that Thompson sampling might not merely be a 
useful heuristic, but a principled method to address problems of adaptive 
sequential decision-making and causal inference. 

1 Introduction 

In a research paper from 1933, Thompson studied the problem of finding out 
which one of two drugs was better when testing them on a patient population 
under the constraint that as few people as possible should be subjected to the 
inferior drug during the course of testing [l|. Given a current probability es- 
timate P of one treatment being better than the other, he suggested that it 
might be a good idea to adjust the proportions of future test subjects that take 
the two drugs to the respective probabilities P and 1 — P. This way one would 
not run into the danger of permanently cutting off all future test subjects from 
a potentially superior treatment that so far seems inferior due to statistical 
fluctuations, while only temporarily risking exposure to a potentially inferior 
drug for a smaller proportion of the population. Randomizing actions based on 
the probability that this action is believed to be optimal when faced with an 
unknown environment is now sometimes called Thompson sampling. 
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Thompson sampling is a form of probability matching. Probability match- 
ing has been extensively studied in both humans and animals when they make 
predictions in stochastic environments [H, Q . Rather than consistently predict- 
ing the most likely outcome, experimental subjects tend to randomize their 
predictions based on the probabilities with which the respective events occur. 
When knowing the probabilities, this is clearly a suboptimal strategy. However, 
in the case of Thompson sampling it is important to note that the probabili- 
ties are not known. Nevertheless, one might argue that Thompson sampling is 
a suboptimal strategy, as Thompson's problem can be thought of as a bandit 
problem which is solved optimally by Gittin's indices in the case of known 
prior probabilities and discounted rewards @. Most studies have therefore ex- 
amined Thompson sam phng as a heuristic in the context of bandit problems 
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Recently, it was shown, however, that Thompson sampling can also be ap- 
plied to solve a more general class of sequential adaptive control problems, 
provided that both an optimal policy and a predictive model is known for each 



possible environment |17| . When an environment is drawn randomly from the 



set of possible environments, the optimal policy can then be inferred on the fly 
by an adaptation process that is driven by actions sampled from the predic- 
tive distribution over the optimal policies. Here we study three characteristic 
features of such generalized Thompson sampling. First, we discuss in how far 
Thompson sampling can be regarded as a natural consequence of a Bayesian 
treatment of policy uncertainty. Second, we study convergence behavior when 
two adaptive Thompson sampling agents are coupled in a sequential fashion. 
Third, we show how this approach can be extended naturally to address the 
problem of causal induction when interacting with an unknown environment. 

The paper is structured as follows. In Section 2 we clarify the problem 
statement and recapitulate the main result of [l7j . In Section 3 we analyze 
the uncertainty faced by agents that are unable to compute the single best 
policy. In Section 4 we study interactions that arise when coupling two adaptive 
agents that employ Thompson sampling to determine their actions. In Section 5 
we investigate how this approach can be applied to adaptive agents that need 
to discover the causal structure of their environment. Finally, we discuss in 
what sense Thompson sampling might provide a principled solution to adaptive 
decision-making problems. 



2 Problem Statement 
2.1 Preliminaries 

We restrict the exposition to the case of discrete time with discrete stochastic 
observations and actions. Let O and A be two finite sets, the first being the set of 
observations and the second being the set of actions. We use a<t ■= aia2 ■ ■ ■ at 
and a<t :— aia2 . . .at-i to simplify the notation of strings. We define the set 
of interactions as Z :— Ax O. The set of interaction strings of length t >Q 
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is denoted by -Z*. The set of all finite interaction strings is Z* := Ut>o 

set of infinite strings is Z°° := {w : w = aiOia202 ■ ■ •}■ The interaction string 

of length is denoted by e. 

Agents and environments are formalized as I/O systems. An I/O system 
R" is a probability distribution over interaction sequences Z^. Pr is uniquely 
determined by the conditional probabilities 



for each oiOi . . .at-iOt-iat G Z* . An interaction system {P,Q) is a coupling 
of two I/O systems, where P is an agent and Q is an environment. Because 
the agent and the environment mutually influence each other, their actions 
and observations are conditioned by the previous interactions. Accordingly, the 
probability of an interaction string a i oi ... otOt is given by 



From the point of view of the agent P, the distribution P(at|a<t, o<f ) is a 
policy and captures the probability of producing action at E A given history 
oiOi . . . at-iOt-i- The distribution P(ot\a<t, o^t) is the agent's predictive model 
of the environment, as it predicts the probability of the observation ot € O given 
history aioi . . . at^iOt-io-t- For the agent P, the sequence 01O2 . . . provides its 
input stream and the sequence aia2 ■ ■ ■ is its output stream. In the case of the 
environment Q the roles are reversed, that is the sequence 01O2 ... is its output 
stream and the sequence 0102 . . . provides its input stream. The quintessential 
goal is to choose the agent's policy such that the resulting distribution over the 
interaction sequences ([1]) is desirable. 

2.2 Policy Construction: Known Environment 

If Q is known, then P can be equipped with a model that can perfectly predict 
its environment, that is P{ot\a<t,o^t) = Qiot\a<t,o^t) for all aiOi ...atG Z*. 
Moreover, a custom-made policy can be designed for P that produces desir- 
able interaction sequences. Desirability is typicall y fo rmalized by the economic 
theory of subjective expected utility (SEU) flsHloj. which stipulates that a 
decision maker's preferences over lotteries can be thought of as maximizing a 
SEU of the outcome. In the policy construction setting, this translates into the 
designer having a real- valued utility function giving rise to utilities U (ai:T, oi:t) 
for each realization, and the predictive model P{ot\a<t,o^t)- The utility func- 
tion quantifies the subjective desirability of a particular interaction string and 
the probabilities represent the subjective model of the environment. The max- 
imum expected utility principle then states that the designer has to choose the 
policy such that it maximizes the expected utility 



Pr(at|a<t,o<t), Pr(ot|o<t, o<t) 



T 



Y]_P{at\a<t, o<t)Q{ot\a<t, o^t). 



(1) 
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0<T,0<T 
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where the probabilities P{a<T,o<T) are the policy-prediction products 



P(at|a<t, o<t)P(ot|a<t, o<t) 
fl P{at\a^t,o<t)\\t{P{ot\a<u o<t)|. (2) 

" V ' " V ' 

policy prediction 

The optimal policy is often computed by restating the problem recursively and 
then using dynamic programming to solve the Belmann optimality equations 
(20I . The policy and the prediction model are both subjective in the sense that 
they are unilaterally chosen by the designer. A policy choice explainable by this 
scheme is defined to be a rational choice. Choices that do not strictly obey the 
maximum SEU principle are irrational^ or at best hounded rational 



ly Ob 

in. 



2.3 Policy Construction: Unknown Environment 

In general, the prediction model will not be equal to the generative law of the 
environment, that is, 

P[ot\a<t,o<^t) ^ Q{ot\a<t,o<^t)- 

and consequently the true expected utility is in general not equal to the SEU: 

Ep,[C/] ^Ep[C/]. (3) 

One of the most interesting cases where the prediction model and the true 
generative law of the environment do not match is when the designer is uncertain 
about the latter. Formally, the designer expresses his uncertainty by introducing 
a random variable that indexes the class of potential environments Qg. More 
specifically, he has a class of prediction models and policies 

B(ot|6',a<t,o<t), B(af|6l, a<t,o<t), (4) 

such that for every possible environment indexed by 9 there is a perfectly fit- 
ting predictor i?(ot|0, a<t, o<t) = (59(ot|a<f , o<t) and a desirable custom-built 
policy -B(at|0, a<t, o<t). Moreover, the designer believes that Qq is drawn with 
probability B{6) from a set 8 of possible environments before the interaction 
starts, where Q is assumed to be discrete for simplicity. 



2.3.1 Decision-theoretic Problem Formulation 

In order to stay within the framework of subjective expected utility one has 
to reduce the problem of the unknown environment to a problem with known 
environment. Such a "known" environment can be created from a set of pos- 
sible environments as a new "super-environment" by marginalizing over the 
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parameter of the possible environments, thus, obtaining the Bayesian mixture 
distribution 22 1 



o<t\0, a<T) 



T 

= Y,B{e)\{B{ot\e,a<uo^t). 

e t=i 

The adaptive control problem is then solved by equating the prediction model 
P with the Bayesian predictive distribution over the observations, 

P(oi|a<f,o<f) := B(of|a<t,o<i), 

and then choosing the policy that maximizes the SEU as in the case of a known 
environment. This procedure effectively enlarges the space of possible environ- 
ments to the convex hull H(Q) spanned by the prediction models in 0, i.e. an 
"environment" is any convex combination of distributions indexed by 9. Each 
6 G H{0) thus corresponds to a Bayesian mixture over the environments in G, 
but with a different prior. If the true generative law coincides with one of the 
models, i.e. there is a 6** G 6 such that 

Q{ot\a<t,o<:t) = B{ot\9* ,a<t,o<^t), (5) 

then the predictive model will converge to the true generative law with Pr- 
probability one, that is 

P(ot|a<t,o<f) Q{ot\d*,a<t,o<^t) 

as i — > oo. Since the policy choice is optimal, this scheme directly bypasses the 
exploration- exploitation dilemma [23^. The important point, however, is that 
in order to express the uncertainty over the environment's probability law, the 
designer had to introduce a belief model that compiles into an actual predic- 
tion model. Thus, the policy construction in the SEU problem statement for 
unknown environments proceeds in two steps: first, a Bayesian mixture environ- 
ment is created, and second a utility-maximizing policy is found on this mixture 
environment. 



2.3.2 Probabilistic Problem Formulation 

An alternative formulation of the problem statement for unknown environments 
is a one-step procedure that essentially stays within (Bayesian) probability cal- 
culus. In this case actions are treated as random variables and our goal is to 
determine a distribution P(aj|a<j, o<j) that tells us how to act depending on 
past actions a<t and past observations o<t. The distribution P(at\d<:t, o<:t) 
is the pendant to the observational distribution P(ot|a<(, o<t) used for predic- 
tion. The only caveat is that past probabilistic actions, unlike past observations, 



have to be marked as causal interventions — denoted by 6 in causal calculus [2J] . 
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Given the models (|3]), both distributions can then be expressed as mixture dis- 
tributions 



P{ot\a<t,o<t) 



y^^B{ot\0,a<t,o<t)B{d\a<t,o<t) 



e 



P{at\a<:t,o<_t) 



B{at\9, a<t, o^t)B{6\a<t, o^t) 



e 



where the posterior is given by 



B{e\a<uo<t) 



B{ot^i\d, a<t-i, o<t_i)B(6'|a<t_i, o<t_i) 



Sampling from P(at|a<t, o<t) is equivalent to sampling a random belief 6 from 
the posterior _B(0|a<t, o<() and then acting according to i?(a(|0, a<t, o<(). This 
corresponds to a generalized Thompson sampling procedure, where first a ran- 
dom belief is sampled and then the optimal policy with respect to this belief 
is executed. Effectively, the posterior is also the only place where causal calculus 
plays a role, as can be seen by the absence of the likelihood B{at-i\9^ a<t_i, o<t_i), 
which is equal to one. Intuitively, the reason for this is that the agent can be 
surprised about his past observations and learn from them, but he cannot be 
surprised about his own actions chosen by himself in the past. Past actions do 
not provide any information about the environment. As will be explained in 
more detail below, causal calculus deals exactly with inference problems when 
some random variables are intervened or set by the inference machine itself. Im- 
portantly, this result is obtained solely by applying basic probability and causal 
calculus. 

3 Policy Uncertainty 

While subjective expected utility is formally appealing as a principle for the 
construction of adaptive agents, its strict application is in practice often prob- 
lematic. This is mainly due to two reasons: 

1. Computational complexity. The computations required to find the opti- 
mal solution (for instance, the computational complexity of solving the 
Bellman optimality equations) are prohibitive in general and scale expo- 
nentially with the length of the horizon: the time complexity of the search 
algorithm is 0{\A x 0\'^). The problem is tractable only in very special 
cases under assumptions that reduce the effective size of the problem. 

2. Causal precedence of policy choice. The choice of the policy has to be made 
before the interaction with the environment starts. That is, an agent has 
to have a unique optimal policy before it has even interacted once with 
the environment. An optimal policy constructed by the maximum SEU 
principle is therefore a very risky bet, as a lot of resources have to be spent 
before any evidence exists that the underlying model or prior is adequate. 
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Because of these two reasons, it is practically often impossible or questionable 
to apply the maximum SEU principle. In the following, we investigate how to 
weaken the formal assumptions of the policy construction method. 

3.1 Policy search 

Given a problem specification in terms of the predictive model and the util- 
ity function, the task of a policy search method is to calculate a policy that 
approximates the optimal policy. 

More specifically, let tt be a parameter in a set 11 indexing the set of candidate 
policies 

J5(at|7r,a<f,o<t) (6) 

analogous to the prediction models (|4]) indexed hy 6 Cz O. Then, in the most gen- 
eral case, a policy search method returns a probability distribution B{t:) over 11 
representing the uncertainty over the optimal policy parameters. If the algo- 
rithm solves the maximum SEU problem, then the support of this distribution 
will exclusively cover the set of optimal policies 11* C H. Otherwise there re- 
mains uncertainty over the optimal policy parameters. 

Policies can also be parameterized in terms of the predictive model. In 
particular, we will assume that for each 9 d Q there is a known optimal policy 
TT e n, such that one can construct a function & : — 11 that maps each 
9 into some tt G 11. Uncertainty over the environment can then directly be 
translated into policy uncertainty, such that any point in the convex hull H(Q) 
can be mapped to a corresponding point in the convex hull ^^(n) spanned by 
the policies tt S 11. 

3.2 The Exploration-Exploitation Trade-Off 

Many policy search methods do not explicitly deal with the uncertainty over 
the policy parameters. Some methods only return a point estimate tt g 11. It 
is obvious that the greedy usage of the estimate n leads to sub-optimal perfor- 
mance, since for all tt that are not in the set 11* of optimal policies, one has 
that 

Es[t/|7r*] >Es[C/|7r] 

where 

EB[U\Tr]= ^ B{a<T,o<T\7r)U{a<T,o<T) 

a<T,0<T 

is the SEU with respect to the policy parameter tt G H. For instance, reinforce- 
ment learning algorithms [4| start from a randomly initialized point estimate 
ttq of the optimal policy and then generate refined point estimates tti, 7T2, ■^'3, . . . 
in each time step t ~ 1,2,3,... using the data provided by experience. In 
order to converge to the optimal policy, these algorithms have to deal with 
the exploration-exploitation trade-off. This means that the agents cannot just 
greedily act according to these point estimates; instead, they have to produce 
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explorative actions as well, that is, actions that deviate from the current esti- 
mate of the optimal policy — for instance producing optimistic actions based on 
UCB 

Let Bt{'K) denote the posterior distribution over the optimal policy at time 
t. Then, 

Bt{TT) ^ Bt{b{e)) ^ Bt{b~\TT)), 

where h~^{n) C is the pre-image of tt. Hence, this shows that finding the 
optimal policy amounts to finding the pre-image of tt* , such that the distribution 
over the policy space becomes the delta function 



1 if e &-i(7r), 
else 



where 0* G 8 is the true prediction model defined in ([5]). This highlights the 
essence of the exploration-exploitation trade-off: any action issued by the agent 
has to respect the uncertainty over the policy parameter — otherwise they are 
biased. In particular, if the agent acts greedily (i.e. it treats the estimate tt as if 
it were the true policy parameter) then it is overfitting the experience; likewise, 
an agent having excessive uncertainty is underfitting. From a frequentist point 
of view, this reveals that the exploration-exploitation trade-off is nothing else 
but the bias-variance trade-off |27[ in policy space. This suggests that just like 
Bayesian modeling naturally balances the bias- variance trade-off by creating es- 
timators that are probability distributions instead of point estimates, Bayesian 
modeling of the exploration-exploitation trade-off leads to a Bayes-causal solu- 
tion for generalized Thompson sampling. 



3.3 Bayes-Causal Solution 

It is important to note that the concept of the bias is conditional on the true 
parameter — which is unknown when the designer is uncertain about the envi- 
ronment. This is not a problem from a Bayesian point of view, because the 
best estimator of the policy parameter is its posterior distribution [is'l . Hence, 
instead of dealing with the exploration-exploitation dilemma by introducing ex- 
plorative actions, one can directly use the posterior distribution over the policy 
parameter as an estimate. 

To see how to do this, note that, by virtue of the mapping tt — b{0), the poli- 
cies are independent of the policy parameter when the environment parameter 
is known: 

B{at\b{0),a<^t,o<t) = B{at\0,a^t,o<t)- 

Hence, each 6 Cz indexes a dynamical model given by the distributions over 
interaction sequences 

T 

B{a<T,o<T\0) = Y\_^M0,a<t,o<t)B{ot\9,a<t,o<t)- 
t=i 
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Given the dynamical models and their prior probabilities, the designer can form 
the Bayesian mixture model 

S(a<T, o<t) = B{9)B{a<T.o<T\0) (7) 

B 

where the sum spans all the parameters in Q. The mixture models in H{Q) \ O 
need not be considered here, since it is assumed that 9* G Q. 

Actions as Causal Interventions 

The designer can directly use the probabilistic model ([7]) to characterize an agent 
with policy uncertainty. There is a caveat though when actions are treated as 
random variables. It is clear that the observations produced by the environment 
update the agent's state of knowledge about the environment. However, the 
actions are set by the agent itself and hence they do not provide information 
about the environment. 

The theory that deals with the distinction between exogenous and endoge- 
nous information is statistical causality [29,, ^] . Observations change the infor- 
mation state by regular Bayesian conditioning, whereas actions constitute causal 
interventions followed by Bayesian conditioning. To calculate the effect of an 
intervention, the causal model, i.e. the unique factorization of the joint distri- 
bution into conditional probabilities matching the causal dependencies over the 
random variables, is required to be known. In our setup, this is straightforward: 
first, the environment secretly chooses a true parameter 0* e 0, and then the 
interactions ai, oi, 02, 02, . . . follow chronologically. 

Formally, this means that the posterior probabilities over the environment 
parameters are given by 

B{e\a<uo^t) 

rather than the more familiar expression B[9\a<^n o<^t), where the "hat" -notation 
at denotes a causal intervention jsij . 

For our needs, it is enough to consider the following simple method to cal- 
culate the effect of causal interventions: 

1. Expand the probabilities in terms of the joint distribution. 

2. Rewrite the joint distribution as the causal factorization. 

3. Remove the intervention tags from the intervened random variables that 
are in the probability conditions. 

4. Replace each conditional probability having an intervened variable in the 
argument by a delta function over its chosen value. 

Applying these four steps to the posterior probabilities over the environment 
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parameters yields 

Bie\a<t,o<t) 
(1^) B{e,a<t,o<t) 
Es' -B(6l',a<t,o<t) 

(2.) B{e) nl=i B{ak\0, a<fe, o<ck)B{ok\9, a<k,o<k) 

(3.) 5(61) nl^i B{ak\6, a<k, o<k)B{ok\0, a<k, o<k) 
~ Ee' B{e') nLi S(afe|e', a^k,o^k)B{ok\e' 

Ee'5(^')nLi5Ki^''«<fc.o<fe)' 

This equation shows that behefs are updated only using the observations, and 
that actions are treated "as if they were known beforehand" , thus providing no 
evidence. 

Likewise, note that 

B{at\6,a<t,o<t) = B{at\9,a<:t,o<:t). (9) 

Using ([8]) and ([9]) , we get the probability of issuing action at G A: 

B{at\a<t,o<t) = ^B{at\9,a<t,o<t)B{9\a<t,o<t)- (10) 


The important fact about (jlOp is it was derived only from probability theory 
and causal calculus by assuming policy uncertainty over a set of policies. We 
can therefore define the policy and prediction models of the agent as 

P{at\a<t,o<t) := B{at\a<t,o<t) ^^^-^ 
P(o(|a<t,o<f) := B{ot\a<t,o<t)- 

The construction of an adaptive agent with policy uncertainty then proceeds 
analogous to a Bayesian inference process. 

1. First, we define a set of prediction models B{ot\0,a^t,o<t) and policy 
models B{at\9, a<t, o^t), where each policy model is optimal for a partic- 
ular environment 9. In the case of inference the latent random variable 
9 £ O corresponds to the hypothesis. 

2. Second, we choose some prior probabilities B(9) to model our prior un- 
certainty. 

3. Third, we use the distribution B{at\a^t, o^t) as the agent's adaptive policy 
P{at\a^t,o^t), and the distribution B{ot\a<t,o^t) as the agent's adaptive 
predictor P{ot\a<t,o^t)- 

Thus, Thompson sampling is used in every time step to sample an action at 
from the predictive distribution B{at\a<:t, o^t)- 
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4 Convergence &; Co- Adaptation 

In [l7| . the limit behavior of a Thompson samphng agent (jlip was investigated. 
Assuming that there exists a behef P(ot|0, a<t, o<t) that perfectly models the 
environment Qe such that P{ot\9,a<t,o^t) = Q6i(ot|a<t, o<t), then the agent 
pT|) converges in the sense that P(at|a<t, o<t) — > P(at|6', a<t, o<t) almost surely 
as t — >■ oo if the interaction system (P, Q) fulfills certain ergodicity requirements 
and all policies P{at\0,a<^t,o<t) are consistent. Roughly speaking, the first 
requirement ensures that the agent can recover from any initial mistakes, and 
the second requirement ensures that all predictors B{ot\0,a<t,o^t) that make 
the same predictions for the tail of the observation sequence are coupled to the 
same policy B{at\0, a<^t, o<t). Thus, the same beliefs imply the same behaviors. 

But what happens if the environment is also adaptive? As long as the 
agent has a model P(ot|0, a<4, o<t) that captures the adaptive behavior of the 
environment nothing fundamentally changes. However, the agent might not 
have a model about the adaptive behavior of the environment, while still having 
a pretty good idea about the environment's preferences. This is typically the 



case in game theory [33, |33|, where the agent knows the other agent's best 
response function, but has no model of the other agent's adaptive behavior. In 
the simplest one-shot simultaneous move games the best response functions are 
given by 

BRi[Q{o)] = argmaxV P(a)Q(o)[/(a,o) 

a,o 

BR2[P{a)] = argmaxV P(a)Q(o)T/(a,o), 

Q(o) 

where player I's best response BRi is a distribution P(a) over actions a that 
depends on agent 2's probability Q{o) of emitting o, and where U and V 
are the payoff functions for player 1 and 2 respectively A Nash equilibrium 
{P*{a), Q*{o)) is a fix point of these coupled equations [34!|, where each individ- 
ual player has no incentive to change his distribution, that is 

a,o a,o 
a,o a.o 

How such equilibria are reached is not subject in classic equilibrium game theory. 
But, if such games are repeated over and over again, evolutionary game theory 
suggests that these equilibria appear as fix points of adaptation dynamics like 
the replicator equations |35l| . A s Bayesian inference can also be viewed as some 
kind of replicator dynamics [36l[37| . this provides an interesting starting point to 
study the emergence of Nash equilibria when two adaptive Thompson sampling 
agents interact. 

When both agents are adaptive according to (ITT]) , we can decompose the 
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policies P{a) and Q{o) into mixture distributions, such that 



P{a) = ^P(a|^)PW 



Q{o) = ^Q(o|OQ(0. 



where P{a\0) = BRi[Q{o)] and Q(o|C) = BR2[P{a)] and P{e) and Q{£,) are the 
prior distributions over the possible behaviors. Moreover, both agents will have 
predictive models over the other agent's behavior, such that 



where P{o\9) = Q{o\^) and Q{a\£,) = P{a\0). In the following we will assume 
that there exists at least one pair {0*,^*) where the model P{o\9*) perfectly 
matches Q{o\^*) such that P{o\9*) = Q(o|^*) and at the same time the model 
perfectly matches P{a\9*) such that P{a\9*) = Q{a\S,*). In this case 
both agents can predict the other agent's behavior, which means that there 
will be no drive to change the posteriors P{9\D) and Q{£,\D), given some past 
experience D. Then, both agents should "lock in" when their posteriors are 
sufficiently close to 60^* and Formally, we define a pair {^*,9*) to be a 

strict Nash equilibrium if 



This corresponds to a lock-in of the predictive models that drive the adaptation 
process. 

In order to study convergence to (9*,^*), we determine the difference in 
relative entropy between the predictive distribution P(o) at time t and the 
generative distribution Q{o\^*) and the relative entropy between the predictive 
distribution P{o\D) at time t+1 and the generative distribution Q{o\^*) — after 
observing D at time step t. This difference can serve as a Lyapunov function to 
show convergence, if we require that 



AKL = Y,Q{D\C)Dkl {Q{o\C)\\P{o\D)) - Dkl {Q{o\C)\\P{o)) < 0. 



Given the predictive distribution P{o) = P{o\9)P{9) at time t and the 



P{o) 



Y^P{o\9)P{9) 



Q{a) 




D 
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predictive distribution P{o\D) = J^e P{o\0)P{6\D) at time t + 1, we get 



AKL = J^QPIDE^H^* 



D 



log 



- log: 



EgPmP{0\D) ^Egpmp{0) 



< 



lOE 



E,P(o|g)P(g) 

P(o|6i*)P(6'*|i:)) 



y Q(o\n log - y Q(D\c) log 



p(7:>|6'*) 



<0 (Nash property) 



<0 (Nash property) 



If the prior weight P{6*) is close to one then the last positive term — logP(0*) is 
close to zero and the two other terms will dominate, making the whole expression 
negative, thus, implying convergence. The argument can also be extended to 
the case where D is generated by Q{D\OQ{0 instead of Q{D\^*), depending 
again on the weight of Q{£,*) compared to the other weights Q{£,), i.e. how close 
the other agent is to the Nash policy. The same argument is then repeated for 
player 2 who uses the prediction model Q{a\£,) to model P(a\9) . 

Consider as an example the matching pennies game |33|, where each player 
has a penny and must decide whether to secretely turn their penny to heads 
or tails. Player 1 wins if both pennies show heads or both show tails, whereas 
player 2 wins if the pennies are unmatched. The payoff matrices of the game 
are 

+ 1,-1) (-1,+1) 
-1,+1) (+1,-1) 

where the first number in each cell is the payoff for player 1 and the second 
number is the payoff for player 2, and the columns of the matrix correspond to 
the choice of heads or tails of player 1, and the rows of the matrix correspond 
to the choice of heads or tails of player 2. The best responses are then 



and 



P{a = H\9) = 



if 61 < 1/2 

1 116*^1/2 
1 if 6* > 1/2 



1 if C < 1/2 
i ife = l/2 
if C > 1/2 
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Evolution of posterior rr 



I Player 2 




3000 



Evolution of average rev^ard 



I Player 2 



^ 0.5 jJ|j^)Oo(Kc^^^c^^^^i;ci^^=^- 



3000 



Figure 1: Matching pennies game with two Thompson samphng agents. The 
left figure shows the evolution of the mean of the posterior distributions over 9 
and ^ for a single run. The right figure shows the average mean reward obtained 
by each player. 



and the predictive models are P{o\9) = 9 and Q{a\^) = ^. For the priors 
we assume the uniform distributions P{9) = 1 and Q{^) — 1, such that the 
posteriors are 



and 



P(6i; 01,02)= , . 

-0(01,02) 

Q(?;6i,62)-^ ^ ^' 



B{biM) ' 

where B{-,-) is the beta function, and oi and 02 is the number of heads and 
tails played by player 2, whereas bi and 62 is the number of heads and tails 
played by player 1. In this case the only (6**, ^*) pair that is a Nash equilibrium 
is in the 50 : 50 case, because only then do action and prediction model fit for 
both agents. In each time step the agents sample 9 and ^ respectively from their 
posteriors P{9\-) and Q(CI') and act with their best response to this sample. In 
Figure|4]it can be seen how they co-adapt and converge to the Nash equilibrium. 



5 Causal induction 

Agent ([TT]) can be thought of as a probabilistic superposition of models 9, where 
each model 9 is characterized by a likelihood model P{ot\9, ao^^at) and a pol- 
icy model P{at\9,ao^t)- In previous applications we assumed that all models 9 
have the same causal structure, i.e. considering multivariate random variables 
at and ot, we assumed that the same variables at are intervened for all 9 and 
the same causal model is used to predict the consequences of these interven- 
tions on the observational variables Of. However, this need not be the case. 
In principle, different models 9 could represent different causal structures and 
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suggest intervention of different variables. Such a setup can be used for causal 
induction. 

Imagine, for example, we are given a device with two light bulbs, one green 
{X) and one red {Y), whose states obey a hidden mechanism that correlates 
them positively. Moreover, the device has switches that allow us controlling the 
state of either bulb.. We encode the "on" and "off" states of the green light 
as X = a; and X = -^x respectively. Analogously, Y = y and Y = denote 
the "on" and "off" states of the red light. We are interested in the explanatory 
power of two competing hypotheses: either "green causes red" {& = 9) or "red 
causes green" (6 — ^9). 

One of the main methods to deal with problems of causal inference is the 
framework of causal graphical models [2^. Given a graph that represents a 
causal structure, we can intervene this graph and ask questions about the prob- 
abilities of the variables in the graph. However, in causal induction we would 
like to discover the causal structure itself, that is we would like to do inference 



over a multitude of graphs representing different causal structures |38j . If one 
would like to represent the problem of causal discovery graphically, the main 
challenge is that the model is a random variable that controls the causal 
structure itself. That is, a tentative graphical representation would be 



meta-level 



which cannot be analyzed using the mathematical framework of graphical mod- 
els alone because the random variable Q operates on a meta-level of the graphical 
model over X and Y . In fact, different causal structures have to be investigated 
by different graphical models, that is the inference process over different causal 
structures cannot be represented in one and the same graphical model. However, 
this difficulty can be overcome by using a probability tree to model the causal 
structure over the random events [39^. Probability trees can encode alternative 
causal realizations, and in particular alternative causal hypotheses (40| . All ran- 
dom variables are then of the same type — no distinctions between meta-levels 
are needed. 



Representation 

We can use probability trees to represent the prediction model that the agent 
has about its environment. An exemplary probability tree for our problem is 
depicted in Figure [2] In this tree, each (internal) node is interpreted as a causal 
mechanism; hence a path from the root node to one of the leaves corresponds to 
a particular sequential realization of causal mechanisms. The logic underlying 
the structure of this tree is as follows: 

1. Causal precedence: A node causally precedes its descendants. For in- 
stance, the root node corresponding to the sure event causally precedes 
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Figure 2: a) An exemplary probability tree to represent the agent's prediction 
model about its environment. The probability of a realization (written under 
the leaves) is calculated by multiplying the probabilities starting from the root 
until a leave is reached. Note that the two hypotheses are statistically indistin- 
guishable, b) The probability tree resulting from (a) after setting X = x. 



all other nodes. 

2. Resolution of variables: Each node resolves the value of a random variable. 
For instance, given the node corresponding to Q = 9 and X — ^x, either 

Y — y will happen with probability Pr{y\9,^x) — j or Y — with 
probability Pr(^y|6', ^a;) — |. 

3. Heterogeneous order: The resolution order of random variables can vary 
across different branches. For instance, X precedes Y under Q = 6, but 

Y precedes X under Q — ^6. This allows modeling different causal hy- 
potheses. 

While the probability tree represents our subjective model explaining the order 
in which the random values are resolved, it does not necessarily correspond to 
the temporal order in which the events are revealed to us. So for instance, under 
hypothesis Q — 9, the value of the variable Y might be revealed before X, even 
though X causally precedes Y; and the hypothesis &, which precedes both X 
and Y, is never observed. 



Interventions 

The importance of interventions to detect causal structure is illustrated in Fig- 
ure m as the observational probabilities are completely symmetric for the two 
halves of the tree. Suppose we observe that both lights are on. Have we learned 
anything about their causal dependency? A brief calculation shows that this is 
not the case because the posterior probabilities are equal to the prior probabil- 
ities: 

Prie\.,y) - Pr(y|^,.)Pr(.|^)PrW 



Pr(y|6l, x)Pr{x\9)Pr{e) + Pr(x\^e, y)Pr{y\^e)Pr(^9) 

3 11 



4 ^ ^ - - -Pr(6»). 



4'2'2~'"4'2'2 
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This makes sense intuitively, because by just observing that the two hghts are 
on, it is statistically impossible to tell which one caused the other. The only way 
to extract causal information is then to intervene, paraphrased as "no causes in. 



no causes out" [41| or "to find out what happens when you kick the system, you 
have to kick the system" [42|. Thus, we now repeat our experiment, but this 
time we turn on the green light {X — x). We reflect this choice by changing all 
the mechanisms that resolve the random variable X, placing all the probability 
mass on the outcome X = x (see Figure Assume that we subsequently 

observe that the second light is on. Then, the posterior probabilities are 

Pr(y|6l,i)Pr(i|6')Pr(6l) 



Vr{e\x,y) 



Pr(y|6l, x)Y>r{x\e)Y>r{e) + Y>r{x\-^e, y)Pr(yh6l)Pr(^6») 

3.1.1 o 



where x is Pearl's notation to indicate a causal intervention of X. Since P{9) < 
P{9\x, y), we have gathered evidence favoring the hypothesis "green causes red" . 
This was only possible because our intervention introduced a statistical asym- 
metry among the two hypotheses that did not exist before. 



Thompson Sampling 

Naturally, multiple interventions and observations can be executed in consecu- 
tion. In this case Thompson sampling is used in each time step to decide which 
policy model to use, which implies the decision which variables to intervene. 
Then, after the intervention, all variables are revealed simultaneously at every 
time step of the inference process. The update of the observational probabilities 
is done the same way as in the one step case, taking into account which variables 
were intervened. A simulation of the repeated Thompson sampling process for 
causal induction of our example system is shown in Figure [3l 



6 Discussion 

Equations (fTT|) was first derived in TT] as the optimal solution to the adaptive 
coding problem given actions and observations as Bayesian rule for control. In 
practice, it is implemented by sampling an environment parameter 9t for each 
time step from the posterior distribution B{9\a^t, o^t), and then treating it as if 
it was the true parameter — that is, issuing the action at from _B(at|0t, a<t, o<t). 
This action-sampling method where beliefs are randomly instantiated was first 
proposed as a heuristic in [ij and is now known as Thompson sampling. Equa- 
tions (|11|) therefore provides a method for generalized Thompson sampling ap- 
plicable to adaptive sequential decision-making problems. 

The main contribution of this paper is to examine three features of such 
generalized Thompson sampling. First, we provide an argument showing that 
Thompson sampling is a natural consequence of a Bayesian treatment of pol- 
icy uncertainty. Policy uncertainty arises whenever an agent is trying to find 
an optimal policy, but is unable to do so — for example due to computational 
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Figure 3: Thompson sampling for causal induction. (Left) Posterior distribution 
P{0\-) for 10 runs when the true system is given hy Q = 9. (Right) Posterior 
distribution P{9\-) for 10 runs when the true system is given by 9 = -^9. In 
both cases the agent is able to identify the causal structure of the environment 
with high confidence when P{9\ - ■ ■) is close to one or zero respectively. 



constraints — , even though the agent might have a coarse idea about the op- 
timum, which can be expressed as a probability distribution. The Bayesian 
treatment of this uncertainty is analogous to Bayesian estimation in the case 
of pure observation problems. The Bayes-optimal estimator in this case is not 
point estimate, but a distribution, which forgoes the bias-variance dilemma. 
Similarly, in the case of actions, the exploration-exploitation trade-off can be 
circumvented by Thompson sampling from probabilistic policies expressed as 
Bayesian mixture distributions. 

Second, we investigated co-adaptation of two adaptive Thompson sampling 
agents. We could demonstrate that such agents converge to Nash equilibria, if 
the parameterized policy set they are choosing from is given by the parameter- 
ized best response functions. This approach also generalizes previous models 



of fictitious play [43|, |4j| that best-respond to the observed frequency of the 



opponent's play rather than best-responding to their randomized beliefs about 
the opponent. Therefore, adaptive Thompson sam plin g agents might provide a 
useful modeling tool for evolutionary game theory [35| and learning in games in 
general (isj . 

Third, we could demonstrate that generalized Thompson sampling can also 
be applied to the problem of causal induction, by designing policy and predic- 
tion models with different causal structures. This way generalized Thompson 
sampling can be used as a general method for causal induction that is Bayesian 
in nature. It is based on the idea of combining probability trees (40| with inter- 
ventions [2^ for predicting the behavior of a manipulated system with multiple 
causal hypotheses. Both the interventions and the constraints on the causal hy- 
potheses introduce statistical asymmetries that permit the extraction of causal 
information. Unlike frameworks that aim to extract causal information from ob- 
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servational data alone [46|, |47|, |48| , the proposed method is designed for agents 
that interact with their environment and use these interactions to discover causal 
relationships. 

So far Thompson samphng has been mainly apphed to muhi-armed bandit 
problems. Multi-armed bandits can be represented by a parameter 6 that sum- 
marizes the statistical properties of the reward obtained for each lever. Reward 
distributions range from Bernoulli to Gaussian (with unknown mean and vari- 
ance), and they can also depend on the particular context or state [ol. [ill. [sl [loj . 
In particular, the work of [llj and the work of [1] prove asymptotic convergence 
of Thompson sampling. The work of 12| presents empirical results that show 
Thompson sampling is highly competitive, matching or outperforming popular 



methods such as UCB [25l. |26|. 



Another class of problems, where Thompson sampling has been applied in 
the past, are Markov decision processes (MDPs). MDPs can be described by 
parameterizing the dynamics and reward distribution (model-based) [4§| or by 



directly parameterizing the Q-table (model- free) [50|, |3lJ. The first approach 
samples a full description of an MDP, solves it for the optimal policy, and then 
issues the optimal action. This is repeated in each time step. The second 
approach avoids the computational overhead of solving for the optimal policy in 
each time step by directly doing inference on the Q-tables. Actions are chosen 
by picking the one having the highest Q- value for the current state. The same 
ideas can also be applied to solve adaptive control problems with linear system 
equations, quadratic cost functions and Gaussian noise 



51|. 



6.1 Optimality 

One of the main arguments is that the derivation presented in Section [3 . 31 shows 
that generalized Thompson sampling is not just a heuristic method, but that it 
can be derived under the assumption of policy uncertainty — simply by applying 
probability theory and causal calculus. This Thompson sampling approach dif- 
fers from the formulation of adaptive control problems as control problems with 
known environments that require the maximization of a subjective expected util- 
ity criterion — compare Section [2.31 The difference between the two approaches 
can be highlighted by contrasting the two one-step decision scenarios depicted in 
Figure m The goal is to predict the outcome of a biases coin with payoffs $1 and 
$0 for a right and wrong guess respectively. A rational decision maker places 
bets (shown inside speech bubbles) such that his subjective expected utility is 
maximized. These subjective beliefs are delimited within dotted boxes. 

The difference between the two becomes clear by inspecting the expected 
utility in each case: they are 

niax^ P(0){^ P'ia\0)P{o\9, a)U{o)}, (a) 
e o 

and ^P(6i)max|^P'(a|6i)P(o|6l,a)C/(o)} (b) 

9 o 
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Figure 4: Comparison between (a) a fully rational decision maker and (b) a 
decision maker with random beliefs. 



respectively. Here it is clearly seen that the difference between the two lies in 
the order in which we apply the expectation (over the environment parameter) 
and the maximization operator. Both cases can be explained in terms of opti- 
mality. However, in (a), decision-maker picks his action taking into account the 
uncertainty over the bias, while in (b), the decision- maker picks his action only 
after his beliefs over the coin bias are instantiated — that is, he is optimal w.r.t. 
his random beliefs. 

The difference between probabilities that one takes into account when mak- 
ing a decision versus the probabilities that are not (i.e. they are immeasur- 
able) has been first proposed by ^5^]. The classical decision theories of (Tl | 
and [igj only consider known probabilities that are reasoned about inside the 
max-operation. Another example where random beliefs play a crucial role is in 
games with incomplete information 33]. Here, having incomplete information 
about the other player leads to a infinite hierarchy of meta-reasoning about the 
other player's strategy. To avoid this difficulty, Harsanyi introduced Bayesian 
games [53|. In a Bayesian game, incomplete knowledge is modeled by ran- 
domly instantiating the player's types, after which they choose their strategies 
optimally — thus eliminating the need for recurrent reasoning about the other 
players' strategy. 

Maintaining and updating Bayesian probabilities is an optimally efficient 
way to deal with uncertainty — be it with respect to the policy or the envi- 
ronment [3l| . Therefore, the central claim is that having random beliefs — as 
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formalized by generalized Thompson sampling — can be considered optimal un- 
der the constraint of having policy uncertainty — uncertainty that is inevitable 
whenever we are unable to compute the optimal policy. Having policy uncer- 
tainty effectively weakens the two assumptions of the maximum expected utility 
principle: the optimal policy can be chosen and refined during interactions, and 
the computational complexity is lower. 

The operational distinction of having policy uncertainty has important al- 
gorithmic consequences. When there is policy uncertainty, the belief of the 
decision-maker is itself a random variable. This means that the very policy 
is undefined until the random variable is resolved. Hence, the computation of 
the optimal policy can be delayed and determined dynamically. It is precisely 
this fact that is (implicitly) exploited in popular reinforcement learning algo- 
rithms, and explicitly in the algorithms based on random beliefs. This is in 
stark contrast to the case when there is no policy uncertainty, where the policy 
is pre-computed and static. 



Adaptive Coding and the Kullback-Leibler Divergence 

Even though the maximization is inside the expectation in case of random be- 
lief approaches to decision-making, there is another outer maximization or op- 
timality criterion implicit, analogous to the case of Bayesian inference that is 
known to optimize Kullback-Leibler divergences. Therefore, it is useful to think 
about the adaptive control problem as an adaptive coding and inference prob- 



lem [17|,|5J]. In terms of the initial problem statement in Section [21 the question 
then is: How can the designer construct a system P defined by P{ot\a<t,o<^t) 
and P{at\a<:t,o<.t) such that its behavior is as close as possible to the custom- 
made system B{pt\9, a<t,o^t) and B(at\9, a^t,o^t) under any realization of Qgl 
Using the Kullback-Leibler divergence as a distance measure, we can formulate 
a variational problem in P 

P := arg min j lim sup ^ P(0) ^ {D'i;^ (Pr) + (Pr)) 1 (12) 

with 

B{at\e,a^t,o<t) 



Dl'{Vr)= J2 B(a<t,o<t|e)^P(at|0,a<t,o<t)log 

a^t^o^t at 

D°*(Pr)= B{a<t,o<t\e)J2B{ot\0,a<t,o^t)\og 



Pr(at|a<t,o<t) 

B{ot\9,a<t,o<t) 
Pr(of|a<t,o<t) 



In the case of observations, this is a well-known variational principle for Bayesian 
inference, as it describes a predictor that requires, on average, the least amount 
of extra bits to capture informational surprise stemming from the behavior of 
the environment. In the case of actions, the same principle can be harnessed 
to describe resourceful generation of actions in a way that requires random bits 
with minimum length on average, when trying to match the optimal policy most 
suitable for the unknown environment 15511 . 
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6.2 Evolutionary game theory 



When deahng with adaptive agents, one of the most intriguing questions is 
what happens if two adaptive agents are coupled. Classic game theory does 
not really allow to address that question, as it abstracts away from learning 
and adaptation processes and focuses on fix point conditions for equilibria. In 
contrast, evolutionary game theory focuses on the dynamics that can lead to 
equilibria [ssf . One of the most widely studied dynamics in evolutionary game 
theory are the so-called replicator equations 



where x\ represents the proportion of type i in a population of individuals at 
time t. The vector xt — {x\,...,x^) represents the population distribution, 
such that ^jxl = 1. The function fi{x) denotes the fitness of type i, which 
depends on the population x. The proportion of individuals of type i at the next 
time point t + 1 is determined by the fitness share this type achieves compared 
to the population total. 

Interestingly, there is a formal correspondence between the replicator dy- 



namics and Bayesian inference 36l. |37| 



p{h\d) = ^^(^)^^(^l^) 



j:^,p{h')p{d\h')' 



where p{h) and p{h\d) represents the prior and posterior probability mass allot- 
ted to hypothesis h. The likelihood function p{d\h) works as a fitness landscape. 
The posterior probability is determined by the likelihood fitness achieved com- 
pared to the overall evidence P{d) — J2h' Pi^')p(^\^')- 

In evolutionary game theory the fix points of the replicator dynamics have 
been studied extensively. In particular, evolutionarily stable strategies j56| have 
been shown to be a refinement of the common Nash equilibrium, in the sense 
that such Nash equilibria are stable with respect to perturbing mutant strate- 
gies. Since generalized Thompson sampling as described in shares its form 
with Bayesian inference, the connection to evolutionary replicator equations is 
immediate. Therefore, we could apply very similar stability arguments in the 
case of two interacting adaptive agents (jlip . as previously applied in the case 
of the replicator dynamics. Generalized Thompson sampling might therefore 
also provide a useful tool in the future to study convergence of co-adaptation 
processes within the context of evolutionary game theory. 



6.3 Causality 

To construct the Bayes-causal solution in Section [5751 we needed to treat actions 
as interventions. This raises the question about why this distinction was not 
made for deriving classical SEU solutions. More formally, under what conditions 
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do the equations in ([TT|) reduce to 

P(af|a<f,o<f) = B{at\a<t,o<t) 
P{ot\a<t,o^t) = S(ot|a<f,o<t), 

that is, with no interventions? 
Since, 

B{at\a^t,o^t) = y^^B{at\0,a<t,o<t)B{0\a<t,o<t) 
e 

B{ot\a<t,o<t) = B{ot\9, a<t, o<t)B{9\a<t, o<t), 
e 

determining the conditions boils down to analyzing when the equalities 

B{e\a<t,o<t) = B{e\a<uO<t) 
B{9\a<t,o<t) - B{9\a<t,o<t) 

hold. Replacing both sides yields, 

B{(^) Y[l=i B{ak\9,a<k,o<k)B{ok\9,a<k,o<k) 
BiO') nLi B{ak\0',a^k,o<k)B{ok\0' 
Bi9)ULiBiok\0,a<k,o<k) 
Ee' BiO')l\LiBiok\9',a<k,o<k) 

Inspecting ([T3)) we conclude that 

B{ak\0,a<:k,o<:k) = (^ajofc), 

i.e. the actions have to be issued deterministically (but possibly history-dependent) 
from a unique policy. Intuitively speaking, this is because the operations of inter- 
vening and conditioning coincide when the random variables are deterministic. 

6.4 Open Problems 

There are important cases where random belief approaches can fail. Indeed, it 
is easy to devise experiments where having policy uncertainty converges expo- 
nentially slower (or does not converge at all) than knowing the optimal policy. 

Consider the following simple example: Environment 1 is a /c-state MDP in 
which only k consecutive actions A reach a state with reward -1-1. Any inter- 
ception with a i?-action leads back to the initial state. A second environment 
which is like the first but actions A and B are interchanged. The optimal policy 
figures out the true environment in k actions (either k consecutive ^'s or B's). 
Consider now an agent with random beliefs: The optimal action in environ- 
ment 1 is A, in environment 2 is _B. A uniform (5,5) prior over the two possible 
environments stays a uniform posterior as long as no reward has been observed. 
Hence, an agent with random beliefs chooses at each time-step A and B with 
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(13) 



equal probability. With this policy it takes about 2^ actions to accidentally 
choose a row of A's (or B's) of length k. From then on the agent acts optimally 
too. Thus, the optimal policy converges in time fc, while the agent with policy 
uncertainty needs exponentially longer. A simple way to remedy this problem 
is, of course, to sample random beliefs only every fc time steps. But this prob- 
lem can be exacerbated in non-stationary environments. Take for instance, an 
increasing MDP with fc = [lOVt ] , in which the optimal policy converges in 100 
steps, while an agent with policy uncertainty would not converge at all in most 
realizations. 

Although [T^ prove asymptotic convergence for general environments fulfill- 
ing a restrictive form of ergodicity condition, this condition needs to be weak- 
ened for the convergence proof to be applicable to most real problems. But it is 
clear that a form of ergodicity is required for an agent with policy uncertainty 
to be able to learn to act optimally. Intuitively, this means that an agent can 
only learn if the environment has temporally stable statistical properties. Fi- 
nally, determining the speed of convergence and the regret is currently an open 
problem. 

7 Conclusion 

In this paper we have argued that policy uncertainty is a natural phenomenon 
that arises whenever there are not enough computational resources to apply 
the maximum SEU principle. We have shown that treating this uncertainty 
in a Bayesian way with actions as random variables that obey causal calculus 
naturally leads to Thompson sampling and its Bayesian generalization. This 
generalized Thompson sampling can be straightforwardly applied to evolution- 
ary game theory and to the problem of causal induction. As these random-belief 
approaches can be derived simply from probability theory and causal calculus 
we suggest that they should not be considered as mere heuristics but as well- 
founded principled approaches. 
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